Module 6: Data Visualization

Jacob Jameson
Summer 2021

Data Visualization

Objectives

  1. Be able to plot in ggplot
  2. Know what types of figures you want
  3. Use plots to gain information

Key points

  1. ggplot syntax
  2. aesthetics mapping
  3. geometry
  4. labels

What is ggplot?

  • ggplot is an implementation of the Grammar of Graphics by Leland Wilkinson for data visualization

    • Grammar of graphics abstraction of graphics ideas “Shorten the distance from mind to page”
    • ggplot is a data visualization package, that is part of the tidyverse suite of packages
  • To use ggplot functions, first load tidyverse

library(tidyverse)

Basic Components of ggplot (Layers)

plot of chunk unnamed-chunk-3

Basic Components of ggplot (Layers)

  • Layer 1: Background layer (ggplot())
    • A data frame
    • Aesthetic mapping: how data are mapped to x-axis, y-axis, color, size, etc
  • Layer 2: Geometry layer (geom_xxx())
    • geometric objects like points, lines, shapes
  • Layer 3: Labels Layer (labs())
    • title, legend, etc
  • Others…

Simplest ggplot Code Structure

In ggplot the simplest structure of the code for plots can often be summarized as

ggplot(data = [dataset],
       mapping = aes(x = [x-variable], 
                     y = [y-variable]) +
                      ...
  geom_xxx() +
  other options

BMI vs. Charges

insurance = read.csv('insurance.csv')
ggplot(data = insurance, 
       mapping = aes(x = bmi, 
                     y = charges)) +
  geom_point()

plot of chunk unnamed-chunk-6

How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend?

plot of chunk unnamed-chunk-7

Aesthetic Mapping

What is Aesthetic Mappings?

  • To display values, map variables in the data to visual properties of the geom (aesthetics)
  • An aesthetic is a visual property of the objects in your plot
  • Including things like size, shape, color or x and y locations

data

Aesthetic Mappings

To display values, map variables in the data to visual properties of the geom (aesthetics)

data

bmi vs. charges + smoker + age

plot of chunk unnamed-chunk-8

Adding Labels

ggplot(data=insurance, 
       mapping=aes(x=bmi, 
                   y=charges, 
                   color=smoker,
                   size=age)) +
  geom_point() +
  labs(title="BMI vs. Charges", x= "BMI", y="Charges")

plot of chunk unnamed-chunk-9

Geometry: Type of the Figures

Questions:

  • What types of figures can ggplot plot?
  • How should we choose the type of figures?
  • How to read different types of figures?

How to Choose Type of Figures?

  • Number of variables
  • Type of variables

Number of variables involved

  • Univariate data analysis

    • distribution of single variable
    • bar plot, histogram, density plot, etc
  • Bivariate data analysis

    • relationship between two variables
    • scatter plot, line plot, boxplot, (segmented) bar plot, etc
  • Multivariate data analysis

    • relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

Types of variables

  • Numerical variables

    • continuous: BMI
    • discrete: age
    • Some typical plot types include scatter plot, histogram, box plot, density plot
  • Categorical variables

    • ordinal: education - highschool, some college, college degree
    • non-ordinal: gender
    • Some typical plot types include bar plots and ordered bar plots

Visualizing Univariant Data

  • numerical

    • histogram, density plot
    • geom_histogram(), geom_density()
  • categorical

    • bar plot
    • geom_bar()

Histograms

ggplot(data = insurance, mapping = aes(x = bmi)) +
  geom_histogram(binwidth = 1)

plot of chunk unnamed-chunk-10

Density PLots

ggplot(data = insurance, mapping = aes(x = bmi)) +
  geom_density()

plot of chunk unnamed-chunk-11

Describing shapes of numerical distributions

  • modality: unimodal, bimodal, multimodal, uniform
  • skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
  • center: mean (mean), median (median), mode (not always useful)
  • spread: range (range), standard deviation (sd)
  • unusual observations

Visualizing Univariant Categorical Data:Bar Plots

ggplot(data = insurance, mapping = aes(x = region)) +
  geom_bar()

plot of chunk unnamed-chunk-12

Visualizing Bivariate Data

  • num. vs num.

    • scatter plot, line plot
    • geom_point(), geom_smooth()
  • num. vs cat.

    • box plot, bar plot
    • geom_boxplot(), geom_bar()
  • cat. & cat.

    • (segmented) bar plot
    • geom_bar()

Num vs. Num Scatterplot

ggplot(data=insurance, 
       mapping=aes(x=bmi,
                   y=charges)) + 
  geom_point(size=3)

plot of chunk unnamed-chunk-13

Num. vs Num.: Smooth Line Plot

ggplot(data=insurance, 
       mapping=aes(x=bmi,
                   y=charges))+
  geom_point(size=3)+
  geom_smooth(se=F)

plot of chunk unnamed-chunk-14

Num. vs Cat.: box (and whisker) plots

ggplot(data = insurance, mapping = aes(y = bmi, x = region)) +
  geom_boxplot()

plot of chunk unnamed-chunk-15

Cat. vs Cat.: Segmented bar plots (counts)

ggplot(data = insurance, mapping = aes(x = region, fill = smoker)) +
  geom_bar()

plot of chunk unnamed-chunk-16

Recap

  • Visualizing our data can help lead to powerful insights between variable relationships
  • ggplot() is a package in R that allows us to make plots
  • There are many ways you can vizualize your data!